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Overview of This Presentation 


Many good compiler technologies and research results 
have been developed for EPIC so far. This talk 1s not 
intended to be a survey. 


Agenda 


— Part I: Current successes with product compilers 


e Brief summary on two categories of program structures: loops 
and acyclic code 


— Part IH: EPIC compiler challenges faced in the real world 


¢ Main portion of this talk 


e Hope to see new compiler technologies developed for EPIC 
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Background 


¢ Itanium™ architecture, an EPIC architecture 
— Many generations of Itanium architecture in the product pipeline 
— First generation came out this year. 


— Second generation (code named McKinley) coming next year 
¢ Compilers 
— Product compilers for Itanium architecture in various corporations 


— Research EPIC compilers in many research labs and universities. 


¢ Grow architecture and compiler together 
— This is only the beginning of EPIC. 
— Both architecture and compiler will improve overtime 
¢ Improved architecture implementations enable more effective 
compiler optimizations and scheduling. 
¢ Improved compilers realize the full architecture potential. 
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What works well: loops 


e Effective software pipelining (e.g. modulo scheduling) 
— Both counted and while loops 
— Predication to eliminate control flow 
— Control and data speculation 
e Extensive set of loop optimizations 
— Cache optimizations 
— Using large register set 
— Rotating registers 
¢ Very effective data prefetching 


— Making use of rotating registers 


e Leading performance achieved 


4 Keynote at the EPIC] Workshop, MICRO34 


for i = 1, 1000 


vii] = yli] + a*x] ==> 


end for 


Example: DAXPY 


e Original code: 2 FP ops, 2 loads, 1 store/1 iteration 
¢ Compiled: 1 fma, | Idp, | st’s, 0.25 Ifetches/1 iteration 


e (nearly) optimal (9 cycle schedule/8 iterations) 


{bt .5: 

{ .mmi 
(p16) 
(p16) 


Idfpd 
Idfpd 
nop.i 


£43,f40=[r47] //0:13 74 
£37,f34=[r49] //0:13 86 
0;; 


} ... many bundles with load, fma sequence 


{ .mfi 


(p18) stfd 

(p17) fma.d 

(p16) add 

} ... few more stores 

{ .mmi 

(p18) stfd 

(p16) Ifetch.nt1 

(p16) add 

} 

{ .mfb 

(p18) stfd 
nop.f 


br.ctop.sptk .b1_5 ;; 


[r14]=f67,64 //22:13 84 
f66=f7,f49,f55 //13:13 83 


r46=64,r47 4: 13 107 B7 
[r44]=f47  //25:13 76 
[r9],64 7:13 97 


r35=64,r36 ;; //7: 13 100 B7 


[r3]=f33,64 
0 


26:13 78 


18:13 113 B7 
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What works well: acyclic code (1) 


¢ Compiler technologies designed to exploit ILP 


— Wealth of techniques available on scheduling, control and data 
speculation, predication etc 


e Scheduler 


Acyclic regions with single entry/multiple exit 


Nested regions 
— Cycle-based schedulers 


— Simple code size awareness 

¢ Control and data speculation 
— Integrated naturally into the scheduler 
— Machinery to generate recovery code 
— Simple heuristics 
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What works well: acyclic code (2) 


e Predication 
— If-conversion to eliminate branches 
— Predicate optimizations 


— Predicate aware register allocators 


¢ Profile exploited for better performance 


— Influenced many components of the compiler, such as the 
scheduler, inliner etc 


— Delivers excellent performance improvement for server 
applications 
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Scheduling Region (Example) 


vi=fmt[i]; 


\,/ 


if (fmt[i]J== ‘e’ || fmtliJ=="E’) { 
total = 1; 
for (j=0; j<cnt[i];j++) { 
total *= cost[j]; 


if (cost[j] == 0) ag 
return; 
v2=cost[0] 
| ee 
else 
total = cost[0]; 
X *= total 
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So any challenges left? 


e Yes. 


¢ Now, let’s look at some performance data from a database 
server 
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Sample Server Performance Data 
on first generation Itanium™ processor 


CPI Cycle Breakdown CPI 
Backend Pipeline Flush Cycles 
Data Access Cycles 
Scoreboard Dependency Cycles 2% 

RSE Active Cycles 7% 


2% 
6% 
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Beyond ILP 


Data and instruction accesses are the dominating issues for 
now. 

— Cache misses 

— TLB misses 

Any ILP improvement will not be visible, until the above 
problems are solved. 
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Top 10 EPIC Compiler Challenges 


. Managing data caches/DTLB for acyclic code 
. Managing instruction cache/ITLB 

. More effective use of control speculation 
More effective use of data speculation 

. Creative use of predication 

. Performance monitor feedback 

. Software pipelining extensions 


. Profiling 


. Whole program optimizations 
10.Performance analysis tools 
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We also have many EPIC Features! 


. Control speculation 
. Data speculation 

. Data prefetching 

. Predication 

. Rotating registers 
RSE 

. Cache hints 


. Instruction prefetch 


CON ADMN KF WN 


. Software pipelining support 
10.Performance monitors 
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Let’s push the 
compiler 
technology 
boundaries for 
EPIC! 
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¢ Acyclic codes, function calls. 


1. Managing Data Caches/DTLB 


¢ Loops for linked lists, Hash-tables, sparse matrix 


¢ Reducing or tolerating Dcache misses 


Data prefetching. Address computation too close to loads. Distance 
too small to cover memory latency. Predict future address. 


Scheduling for cache misses 


e e.g. increase distance between load and use 


Control speculation 
Data speculation 


Identifying missing accesses 


¢ Reducing DTLB misses 


— Data layout optimizations such as structure layout and splitting. 
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Example: scalar code 


e Small distance between the function call to check() and the 
load of *p_last in the loop. 


¢ Not effective to prefetch above check(). 


void foo (ztype *status) 
{ 
int last = *p_last; // cache hit for *p_last 
if (check (n, 1)) // *p_last evicted 
*status = Err; 
// high cache miss rate for *p_last 
for (n= *p_last; n>last; n--) —_ // average trip count | 
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Any Creative Use of EPIC Features? 


Ol leah 
143 = address of D(2+k,j) C,D 
144 = address of C(1+k,j) 
fori=1,m | 

prefetch [144] 

r42 = 1r44+16 

r44 = r43 

143 = 142 

C(i,j) = DG-1,j) + DGi+1,j) 
end_for 

end_for 


——* jj 
1 


r42 — 143 — fr44 
We 


Uses just one prefetch instruction for multiple arrays. 
Achieved by using rotating registers, but this is not 

__ acyclic code. 
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2. Managing Instruction Cache/TLB 


¢ Icache/ITLB misses for large applications 


¢ Reducing Icache/ITLB misses 
— Function layout (page faults, cache associativity) 


— Function splitting 


Basic block ordering 


Scheduling for code size 


Instruction prefetching 
e Streaming prefetch 


e Hint prefetch 
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Scheduling for Code Size 


e Utilize slack for code size. 
— Arrange instructions to better utilize bundles 
e Tradeoff between ILP and code size: 


— E.g. compensation code. 


// 3 bundles 

QO: 182 = 134 + 135, 11, 23; 

1: 132 =133 + 134, nop, nop; 
5: st [r40] = f82, nop, nop 


// 2 bundles 
=> 0: {32 =f34 + 135, 11, 2: 
5: st [r40]=f82, r82=r33+r34, nop 
intel. 
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Instruction Prefetching 


e Streaming and hint prefetch. 


: T+0 
// hint a ee 


brp.many/few 256 
384 


cycles... 


// streaming 
(px) br.many T 


Issues: 
e When to use streaming? What if the branch is not 
taken? 
¢ Is brp worth a new bundle? .few vs .many? 


intel. 
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3. Cost of Control Speculation 


¢ Understanding the cost of speculation on the right path 
— Code size (like chk.s and recovery code) 


— Register pressure 


¢ Understand the cost of speculation on the wrong path 
TLB lookup (DTC, DTLB and VHPT) 


Cache pollution 


Avoid cache miss penalty due to speculating uses 


Code size 


¢ How do we get the benefit and avoid the cost? 
— E.g. Utilizing slack to avoid unnecessary speculation? 
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Example: cost of speculation 


e The branch is likely not taken. Speculation helps. 
¢ But cache miss latency exposed, if the branch is taken. 


intel. 


(01) br els;; 
le ele OLCoa 7 


= 11 


als: 


nxt: 


// prob .3 


22 


r1 =Id.s 
//cache miss! 
= 11 


(01) br els;; 
chk.s 


els: 


nxt: 
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4. Data Speculation 


¢ More opportunities in real application environment 
— without whole-program alias analysis 
— However, one has to assume a best possible disambiguator. 
¢ When to speculate? 
— Determine low-probability aliases 
¢ Modeling the cost of advanced load 
— E.g. code size 
e Scheduling with data speculation 
— Utilizing slack? 
e¢ Optimizations with data speculation 
— E.g. speculative loop invariant removal 
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Example: utilize slack 


e No advantage to speculate Id [r45], even it is likely 
independent of st [r4]. 


st [r4]= r3;; 

r6 = Id [r44] // likely dependent with st 

r7 = Id [r45];; // likely independent with st 
= add r6, r/ 
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5. Creative Use of Predication 


e E.g. partial predication 
— Eliminate branches. 
— Better code bundling 
— Hot short path not hurt by the cold long path 


(OF) br els;; 
11, nop, br nxt;; //prob 0.95 
els: i2, nop, nop;; //prob 0.05 
om aoe 
nxt: 


(pT) 11, (PF) i2, (pT) br nxt;; 
—=>> i3, i4, i5; 
nxt: 
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6. Performance Monitor Feedback 


Current profile info: 

— Branch probability 

— Basic block count 

— Value profiling 

Performance counters (low-cost) 
— Cache miss 

— TLB miss 

— Branch probability 

— Branch misprediction rate 


Design heuristics. How to use this effectively? 


Profiling user model 
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7. Software Pipelining 


¢ Low trip count innermost loop. 
e Creative use of predication and data speculation? 


for (r = 0; r < rows; r++) { 
float s = I[r]; 
for (e = row(r); e < row(r+1); e++) { 
s += M[e] * V[C[e]]; 


} 
S[r] = s/D[r]; 
} 


¢ low trip count inner loop => swp overhead not amortized 
¢ loop bounds data dependent => complete unroll not possible 
¢ outer loop trip counts large => inefficiency exaggerated 
¢ accessed indirectly => cache latency exposed, prefetch hard 
intel. 
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8. Profiling 


Profile-driven optimizations deliver large performance 
gain for EPIC. 
Profile-driven optimizations under small source changes 
— Incremental changes to profile? 
— Incremental profile collection? 
Some applications have no profile 
— E.g. shipped as a library. Different workloads for users. 
— E.g. no representative profile 


— Optimizations work well with and without profile 


Re-optimization of binaries with profile? 
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9. Whole Program Analysis 


Peak performance often reached via detecting whole 
program. 
Whole program for some applications not possible at 
compile time. 

— E.g. shipped as a library. 
Better way to achieve the effect of whole program 


— E.g. summary information on individual modules. 
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10. Performance Analysis Tools 


¢ Current tools: 
— hot spots, but what if the profile is flat? 


— IP-based performance info like cache miss rate. Local info. Scope 
too small to drive optimizations. 


— Aggregated performance data. Not enough to guide the design of 
new optimizations. 
¢ How far are we from the optimal? 
— E.g. for loops, we know the min II. 
— Theoretical models (like integer programming) for scheduling 
— Data locality 
— Code locality (e.g. block ordering) 
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Other Opportunities? 


Exploiting multithreading 
— Optimizing multithreaded applications using IPF features 
— Exploiting multi-threaded IPF architecture 


e Speculative pre-computation 
Exploiting EPIC for JIT 
Dynamic optimizations 


Better debugging support for optimized code 
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Summary 


Early successes with compiling for EPIC. 


— Effective loop scheduling and data locality techniques. Excellent 
performance results. 


— Scheduling and optimizations of acyclic code for ILP 


Data and instruction accesses 1n acyclic code are the main 
challenges for now. 
— Any ILP improvement will likely not be visible. 
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